Code
library(ggplot2)
library(tidyverse)
library(here)
library(janitor)
library(cowplot)
library(patchwork)
library(lubridate)
library(forcats)
library(stringr)
library(viridis)
library(tidymodels)
library(broom)
library(kableExtra)Sam Lance
Master of Environmental Science and Management at the The Bren School (UCSB), Advanced Data Analysis (ESM 244)
February 14, 2025
Data Description:
The Palmetto Dataset describes two species of palmetto Serenoa repens and Sabal etonia at the Archbold Biological Station in south-central Florida between 1981-1997 and again in 2001-2017 at five year intervals. The dataset measures many plant features, but this analysis will focus on canopy height, widest canopy length, widest canopy width, and the number of green leaves.
Data Source:
Abrahamson, W.G. 2019. Survival, growth and biomass estimates of two dominant palmetto species of south-central Florida from 1981 - 2017, ongoing at 5-year intervals ver 1. Environmental Data Initiative.
Central Questions:
What variables help us to accurately predict whether a palmetto plant is of the species Serenoa repens or Sabal etonia?
Psuedocode:
Acquire + Clean Data
Perform Exploratory Data Analysis
Build + Select Models
Select model variables - length, width, height, # of green leaves
Train the model - train the model to understand the data
Model selection - c fold validation to pick which model is best
Finalize Model
Interpret the Results/ Visualize
palmetto_og <- read.csv(here("posts","2025-03-19-palmetto-244", "data", "palmetto.csv")) |>
mutate(species = as.factor(species))
palmetto_clean <- palmetto_og |>
mutate(species = ifelse(species %in% c(1, 2),
ifelse(species == 1, "S. repens", "S. etonia"), col1)) |>
select(species, height, width, length, green_lvs)height_plot <- palmetto_clean |>
ggplot(aes(x=species, y=height, fill=species)) +
geom_boxplot() +
scale_fill_viridis(discrete = TRUE, alpha=0.6, option = "G") +
theme_classic() +
theme(legend.position="none",
plot.title = element_text(size=11),
axis.text.x = element_text(size = 10),
axis.text.y = element_text(size = 10),
axis.title.y = element_text(size = 10)) +
labs(x = NULL, y = "Height (cm)", title = NULL)length_plot <- palmetto_clean %>%
ggplot(aes(x=species, y=length, fill=species)) +
geom_boxplot() +
scale_fill_viridis(discrete = TRUE, alpha=0.6, option = "G") +
theme_classic() +
theme(legend.position="none",
plot.title = element_text(size=11),
axis.text.x = element_text(size = 10),
axis.text.y = element_text(size = 10),
axis.title.y = element_text(size = 10)) +
labs(x = NULL, y = "Length (cm)", title = NULL)width_plot <- palmetto_clean %>%
ggplot(aes(x=species, y=width, fill=species)) +
geom_boxplot() +
scale_fill_viridis(discrete = TRUE, alpha=0.6, option = "G") +
theme_classic() +
theme(legend.position="none",
plot.title = element_text(size=11),
axis.text.x = element_text(size = 10),
axis.text.y = element_text(size = 10),
axis.title.y = element_text(size = 10)) +
labs(x = NULL, y = "Width (cm)", title = NULL)leaves_plot <- palmetto_clean %>%
ggplot(aes(x=species, y=green_lvs, fill=species)) +
geom_boxplot() +
scale_fill_viridis(discrete = TRUE, alpha=0.6, option = "G") +
theme_classic() +
theme(legend.position="none",
plot.title = element_text(size=11),
axis.text.x = element_text(size = 10),
axis.text.y = element_text(size = 10),
axis.title.y = element_text(size = 10)) +
labs(x = NULL, y = "Green Leaves (#)", title = NULL)Based on the exploratory data analysis conducted, while height and width do not vary much between species, the length and amount of green leaves are visually different.
The results of this regression indicate that for model 1 that includes variables height, length, width, and the number of green leaves all possible variables are significant. Model 2, which includes the same variables except length, indicates that only width and green leaves are significant.
Comparing the AIC values of each model, model 1 has a smaller AIC value, indicating it better describes the data presented.
Parameters Included: Plant height, canopy length, canopy width, and green leaves
Results: This model is 91% accurate, with a ROC value of 97%
set.seed(2980)
#splits dataset into 10 folds and repeats 10 times
folds_1 <- vfold_cv(palmetto_og, v = 10, repeats = 10)
#creates a regression model
blr_mdl_1 <- logistic_reg() |>
set_engine('glm')
#adds model and formula to the workflow
blr_wf_1 <- workflow() |> ### initialize workflow
add_model(blr_mdl_1) |>
add_formula(formula = f1)
#adds the folds to the workflow and stores the results
blr_fit_folds_1 <- blr_wf_1 |>
fit_resamples(folds_1)
#collect_metrics(blr_fit_folds_1)Parameters Included: Plant height, canopy width, and green leaves
Results: The model is 89% accurate, with a ROC value of 96%
set.seed(2980)
folds_2 <- vfold_cv(palmetto_og, v = 10, repeats = 10)
blr_mdl_2 <- logistic_reg() |>
set_engine('glm') ### this is the default - we could try engines from other packages or functions
blr_wf_2 <- workflow() |> ### initialize workflow
add_model(blr_mdl_1) |>
add_formula(formula = f2)
blr_fit_folds_2 <- blr_wf_2 |>
fit_resamples(folds_2)
#collect_metrics(blr_fit_folds_2)Model 1, which included plant height, canopy length, canopy width, and the amount of green leaves, was chosen for the remainder of this analysis due to its higher accuracy and greater ROC value, both of which indicate better model performance.
| term | estimate | std.error | statistic | p.value |
|---|---|---|---|---|
| (Intercept) | 3.2266851 | 0.1420708 | 22.71180 | 0 |
| height | -0.0292173 | 0.0023061 | -12.66984 | 0 |
| length | 0.0458233 | 0.0018661 | 24.55600 | 0 |
| width | 0.0394434 | 0.0021000 | 18.78227 | 0 |
| green_lvs | -1.9084747 | 0.0388634 | -49.10728 | 0 |
Model Success: Our model was able to predict the species given palmetto plant by its characteristics (height, length, width, and number of green leaves) between 90% and 92% of the time. This indicates the model has a high predictive accuracy.
accuracy_by_species <- palmetto_predict |>
select(species, .pred_class) |> #pick two columns to allow drop_na to work
drop_na() |> #drop na to allow the code to work
mutate(correct = ifelse(species == .pred_class, 1, 0)) |>
group_by(species) |> #group by actual species
summarise(
total_cases = n(), #count total cases per species
correct_predictions = sum(correct), #count the number of correct species predictions
incorrect_predictions = total_cases - correct_predictions, #count number of incorrect predictions
accuracy = (correct_predictions / total_cases) * 100) #find the accuracy
kable(accuracy_by_species, caption = "Model 1 Accuracy at Predicting Palmetto Species") %>%
kable_styling(bootstrap_options = c("striped", "hover", "condensed"), full_width = F)| species | total_cases | correct_predictions | incorrect_predictions | accuracy |
|---|---|---|---|---|
| 1 | 6112 | 5548 | 564 | 90.77225 |
| 2 | 6155 | 5701 | 454 | 92.62388 |